{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0ac41b5c-8c03-4ad6-895f-a7847cd5f160",
   "metadata": {},
   "source": [
    "# semantic chunk\n",
    "\n",
    "根据向量模型来进行划分\n",
    "- https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb\n",
    "- https://python.langchain.com/v0.2/docs/how_to/semantic-chunker/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd4c1f06-2076-4fd0-ae1b-b1a77a91629d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "dd13b132-f432-4231-a063-d024d0227a1d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_experimental.text_splitter import SemanticChunker\n",
    "from langchain_community.embeddings import HuggingFaceEmbeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7c7c2894-25fd-48cb-a2fc-53c52b882c8c",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\9N3GPC3\\AppData\\Local\\Temp\\ipykernel_5616\\2437591668.py:4: LangChainDeprecationWarning: The class `HuggingFaceEmbeddings` was deprecated in LangChain 0.2.2 and will be removed in 1.0. An updated version of the class exists in the langchain-huggingface package and should be used instead. To use it run `pip install -U langchain-huggingface` and import as `from langchain_huggingface import HuggingFaceEmbeddings`.\n",
      "  hf_embedding = HuggingFaceEmbeddings(\n",
      "D:\\python311\\Lib\\site-packages\\sentence_transformers\\cross_encoder\\CrossEncoder.py:11: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n",
      "  from tqdm.autonotebook import tqdm, trange\n"
     ]
    }
   ],
   "source": [
    "model_name = \"../../bge-small-zh-v1.5\"\n",
    "model_kwargs = {'device': 'cpu'}\n",
    "encode_kwargs = {'normalize_embeddings': True}\n",
    "hf_embedding = HuggingFaceEmbeddings(\n",
    "    model_name=model_name,\n",
    "    model_kwargs=model_kwargs,\n",
    "    encode_kwargs=encode_kwargs\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "316f10cb-5ad2-4c25-8121-d9e6a744ad19",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"\"\"蔚来萤火虫即将交付，李斌所预测的两年内新能源车渗透率超80%能否实现？\n",
    "\n",
    "9 月 6 日消息，近日，蔚来汽车创始人、董事长李斌在蔚来九周年内部讲话及财报电话会上透露了多项重要信息，其中最引人注目的莫过于第三品牌萤火虫将于 2025 年正式交付，并预测新能源汽车的市场渗透率将在未来两年内超过 80%。\n",
    "\n",
    "据李斌介绍，蔚来汽车将形成三个品牌矩阵，覆盖从 14 万元到 80 万元的广阔市场区间。其中，第三品牌萤火虫作为蔚来汽车布局中低端市场的重要棋子，将于 2025 年正式交付。这一举措不仅丰富了蔚来的产品线，也进一步提升了其在新能源汽车市场的竞争力。\n",
    "\n",
    "This chunker works by determining when to \"break\" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split.\n",
    "\n",
    "There are a few ways to determine what that threshold is, which are controlled by the breakpoint_threshold_type kwarg.\"\"\"\n",
    "text_splitter = SemanticChunker(hf_embedding, breakpoint_threshold_type='gradient')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "0303f54e-b7ba-4869-997b-4e85518db048",
   "metadata": {},
   "outputs": [],
   "source": [
    "# https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb\n",
    "# https://python.langchain.com/v0.2/docs/how_to/semantic-chunker/\n",
    "docs = text_splitter.create_documents([text])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "31f905ee-847d-40cc-8fdb-672b222acfb9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "2a7ec8ff-b3d9-4325-8661-2511da672d9a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='蔚来萤火虫即将交付，李斌所预测的两年内新能源车渗透率超80%能否实现？\\n\\n9 月 6 日消息，近日，蔚来汽车创始人、董事长李斌在蔚来九周年内部讲话及财报电话会上透露了多项重要信息，其中最引人注目的莫过于第三品牌萤火虫将于 2025 年正式交付，并预测新能源汽车的市场渗透率将在未来两年内超过 80%。\\n\\n据李斌介绍，蔚来汽车将形成三个品牌矩阵，覆盖从 14 万元到 80 万元的广阔市场区间。其中，第三品牌萤火虫作为蔚来汽车布局中低端市场的重要棋子，将于 2025 年正式交付。这一举措不仅丰富了蔚来的产品线，也进一步提升了其在新能源汽车市场的竞争力。\\n\\nThis chunker works by determining when to \"break\" apart sentences.'),\n",
       " Document(page_content='This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split. There are a few ways to determine what that threshold is, which are controlled by the breakpoint_threshold_type kwarg.')]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b50a76da-9edf-4b0f-a3a4-7a1267e67598",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "55d2810d-5e15-4d73-880c-29c04aa6f84e",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0rc2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
